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Databases for use with analytical chem- 
istry instrumental techniques are sur- 
veyed, with attention to existing 
databases and collection efforts now un- 
derway, as well as needs for new data- 
bases. Collections of spectra for use in 
NMR, infrared spectroscopy, and mass 
spectroscopy are described. Using mass 
spectral databases as an example, a cri- 
tique is presented of automated quality 
control procedures used to evaluate in- 
dividual spectra in large collections; the 
kinds of problems which have been en- 
countered in using these procedures are 
discussed. Finally, a brief critical review 



is presented covering the application of 
computers to the identification of un- 
known compounds using spectral data- 
bases; again, algorithms used with mass 
spectrometry are taken as the example. 
Ongoing work at NIST with the NIST/ 
EPA/MSDC Mass Spectral Database is 
concerned with many of these problems; 
recent developments are described. 



Key words: analytical chemistry; com- 
puter; database; evaluation; infrared 
spectrum; mass spectrum; nuclear mag- 
netic resonance. 



1. Introduction 



In principle^ the measurement technique in 
which spectroscopy is used as an analytical tool 
involves obtaining a spectrum of the sample of in- 
terest (the "unknown") and identifying the un- 
known compound by the similarity of its spectrum 
to that of a particular ("known") chemical com- 
pound. Here we use the word "spectroscopy" in 
the broadest possible sense; spectroscopy is taken 
to be any experimental technique which provides a 
reproducible "spectrum" characteristic of particu- 
lar chemical species. This includes, for example, all 
optical spectroscopy, nuclear magnetic resonance, 
electron spin resonance, mass spectrometry, and so 
on. 

Of course, from the beginning of the use of spec- 
tral techniques in the analytical laboratory, it was 
recognized that the comparison spectra need not be 
obtained at the same time, or even on the same 



instrument, as the analysis itself Because one could 
collect standard spectra and use them over and 
over again, it is not unexpected to find that there is 
a long history of data collection efforts aimed at 
analytical applications [1,2]. With the beginning of 
the computer age, it was of course a natural exten- 
sion of these activities to store spectral databases 
on computers, and to conduct automated searches 
of those databases in order to "match" the spec- 
trum of the unknown compound with that of a 
standard reference compound. The use of auto- 
mated instruments equipped with reference li- 
braries has become a well-established measurement 
technique for analytical chemistry. At the present 
time, computerized algorithms are also used to 
evaluate the large numbers of spectra which com- 
prise these collections. 
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In spite of the long history of data collection ef- 
forts involving analytical spectra, there is some dis- 
satisfaction with the size and quality of available 
collections. For example, in 1986 Thomas L. Isen- 
hour wrote an editorial [3] describing the consen- 
sus of experts concerning computerized databases 
for use in analytical chemistry measurement tech- 
niques: "... the current state of spectroscopic data- 
bases is such that it inhibits good applications of 
known search and interpretive procedures as well 
as further research on these methods. ... we do not 
in general have high quality spectroscopic data- 
bases available... Perhaps 10 million chemical com- 
pounds are now known. Some measurements have 
been made on all of them. Very few, if any, struc- 
ture identifications have been made in recent times 
without resorting to some form of spectroscopy. 
Why then are the largest available spectral data 
files in computer format limited to a few tens of 
thousands of compounds?" 

This paper presents a brief survey of the use of 
automated databases as an integral part of spectro- 
scopic measurement techniques for analytical 
chemistry, with emphasis on mass spectrometric 
databases. Because of the rather dim view by the 
experts of the analytical databases in common use, 
the survey includes a list of the most popular auto- 
mated analytical databases with attention to the 
numbers of spectra available in each of them. A 
discussion of the current state of automated evalua- 
tion algorithms being used with mass spectral data- 
bases is included. 

Ongoing work aimed at updating and improving 
the quality of the mass spectrometric database dis- 
tributed by the National Institute of Standards and 
Technology Office of Standard Reference Data is 
described. 



2. Brief Survey of Automated Analytical 

Databases 
2.1 Nuclear Magnetic Resonance Spectroscopy 

(NMR) 

The databases listed below are all provided with 
software which enables the user to look up particu- 
lar spectra or to match the characteristics of a par- 
ticular spectrum of an unknown compound. Most 
NMR databases also include software for spectrum 
estimation and interpretation. 
2.1.1 C-13 NMR Database on the Chemical Infor- 
mation System The Chemical Information System 
[4] collection currently consists of a total of 1 1 ,700 



'^C NMR spectra. The database was last updated in 
November 1985, when many incorrect assignments 
in older spectra were corrected, and over 4,000 
new spectra were added. The database was origi- 
nally put together by the Royal Dutch Chemical 
Society (also called Netherlands Information Com- 
bine). 

2.1.2 C-13 NMR Online Service of the Fachinfor- 
mationszentrum (FIZ), Karlsruhe, W. Germany 
(accessed in the U.S. through STN International) 
This widely-used NMR database was added to the 
STN system [5] in December 1987, having been 
marketed previously in the U.S. by Scientific Infor- 
mation Service (SIS). The collection contains 
67,500 '^C chemical shifts, coupling constants, and 
relaxation times. 

2.1.3 Bruker Spectroscopic Database This data- 
base is available to Bruker customers for on-site 
use. It requires a Bruker Aspect 2000 or 3000 com- 
puter together with a Bruker software package 
(BASIS — Bruker Automatic Spectroscopy Inter- 
pretation System). The database contains various 
modules, including '^C NMR (19,000 spectra), 'H 
NMR (900 spectra), as well as a combined '^NMR- 
MS database. 

2.1.4 Sadtler Laboratories This database consists 
of 24,000 sets of "C NMR chemical shifts with 
compound names, and also 10,000 "C NMR spec- 
tra in full digital format that can be used to view 
expanded displays of the spectra [Ic]. The database 
is designed for use with Sadtler's own '^C search 
software package, which operates on IBM-compat- 
ible personal computers. 

2.1.5 Collection of National Chemical Laboratory 
for Industry, Japan The integrated online "Spec- 
trum Database System" [6], which includes collec- 
tions of NMR, ESR, IR, Raman, and mass spectra 
has both 'H NMR spectra (6,000 compounds) and 
"C NMR spectra (5,700 compounds) along with 
search software enabling a user to look up a partic- 
ular spectrum (and conditions under which it was 
run) or to match an unknown spectrum. All spectra 
were determined at the NCLI under carefully con- 
trolled conditions. 

2.1.6 Other Collections of NMR Spectra The list 
given above is not exhaustive. For example, Varian 
also markets an NMR database, and Tsukuba Uni- 
versity (Japan) produces a CD-ROM collection of 
'^C NMR spectra of polymers. The data came from 
existing handbooks. The system also contains pro- 
grams to synthesize the NMR spectra from struc- 
tural information. 
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2.2 Infrared Spectra (IR) 

In the field of infrared spectroscopy, many large 
collections of spectra were built up [1,7] at a time 
when the spectrometers in use were prism and 
grating instruments. Within the past decade, the in- 
strumentation in general use in analytical laborato- 
ries has changed to Fourier transform infrared 
spectrometers (FT-IR), which generate digitized 
spectra. Although the older analogue spectra can 
be digitized to be made compatible with the data 
systems of the newer instruments, questions have 
been raised about the desirability of doing this. In 
the opinion of some experts [8], many of the older 
collections of spectra are no longer adequate to 
serve as reference spectra for comparison with re- 
sults taken on the newer instruments. For this rea- 
son, effort has been given recently to building 
completely new collections of IR spectra which 
were generated in digital format in FT-IR instru- 
ments. In the discussion which follows, attempts 
will be made to distinguish between the newer dig- 
itized collections, and databases of spectra from 
prism and grating spectrometers. 

2.2.1 Aldrich-Nicolet Digital FT-IR Database and 
the Sigma-Nicolet Biochemical Library Nicolet, in 
collaboration with Aldrich and with Sigma, is pro- 
ducing high quality databases of FT-IR spectra of 
the compounds in the catalogues of these two com- 
panies. The Aldrich-Nicolet collection contained 
10,600 compounds and the Sigma-Nicolet collec- 
tion, 10,400 compounds in 1987. These databases 
are being updated in 1988 with the addition of sev- 
eral thousand new spectra. The databases are de- 
signed for use on several popular personal 
computers, and are distributed with software 
which is geared to locating spectra which match 
the peak intensities and locations from an IR spec- 
trum of an unknown substance. 

2.2.2 Sadtler Research Laboratories Spectra The 
largest commercially available collection of in- 
frared spectra [Ic], with > 60,000 spectra largely 
from prism and grating spectrometers. The current 
collection also includes some FT-IR spectra. 

2.2.3 Coblentz Society Spectra Beginning in the 
mid-1960s, the Coblentz Society, in collaboration 
with the Joint Committee on Atomic and Molecu- 
lar Physical Data (JCAMP), put together a collec- 
tion of 10,500 donated infrared spectra taken on 
prism and grating spectrometers. The effort in- 
cluded developing evaluation procedures for IR 
spectra, and evaluating the entire collection of 
spectra. The collection was originally distributed 
in 10 volumes in a looseleaf notebook format [7]. 



Recently, 4,400 of these spectra have been digi- 
tized, and will be made available through the 
Coblentz Society, which is also digitizing the re- 
maining spectra. Dr. Clara Craver, of the Chemir 
Labs, who played a key role in putting together the 
original Coblentz Society collection, is actively so- 
liciting donations of new spectra to increase the 
size of the database, which will be available in a 
format for use with personal computers. 

2.2.4 EPA Vapor Phase Spectra This collection 
of 3,300 spectra originated in laboratories of the 
EPA, and is in the public domain. Although not 
commercially available as a collection, the spectra 
are available through the instrument companies 
manufacturing IR spectrometers. 

2.2.5 Collection of the Univeristy of California- 
Riverside "Clearinghouse for Digital Infrared Spec- 
tra" A new project was initiated in October 1986 
for the collection of a database of digitized FT-IR 
spectra under the leadership of Drs. Peter Griffiths 
and Charles Wilkins at the University of Califor- 
nia-Riverside. They hope to tap several collections 
of high quality digital spectra measured in various 
analytical laboratories for internal use. This team 
has put together an automated algorithm for evalu- 
ating the spectra of this collection [8]. 

2.2.6 Infrared Data Committee of Japan (IRDC) 
This organization has distributed IR spectra in 
printed form on edge-punched cards since 1961 
[Id]. About 19,000 cards are now available. In 
1980-85, the peak wavenumbers and intensities 
were extracted and entered into a computer file. 
Search software for the database has been pre- 
pared. A search involves entering wavenumbers 
and intensities in order of decreasing intensity; no- 
band regions can be specified. Spectra which are 
retrieved in a search are listed in order of the prob- 
ability of being a correct match. The publisher of 
the IRDC cards is also marketing the above system 
in magnetic tape form. The possibility of fully digi- 
tizing the IRDC spectra has been discussed, but no 
decisions have been made. 

2.2.7 Collection of National Chemical Laboratory 
for Industry, Japan The integrated online "Spec- 
trum Database System" [6], which includes collec- 
tions of NMR, ESR, IR, Raman, and mass spectra, 
also makes available a database of 22,500 infrared 
spectra. All spectra were determined at the NCLI 
under carefully controlled conditions. Data were 
transferred in digital form directly from the FT-IR 
instrument on which they were determined to the 
database. The database is available online to users 
in Japan. 
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2.2.8 American Society for Testing and Materials 
Collection Comprehensive indices coded by 
ASTM Committee E- 13.03 for the infrared spectra 
from most of the older general collections are 
available from Chemir Labs, Sadtler Research 
Labs, and on-line on the Canadian Scientific Nu- 
meric Data System. Data for 145,000 compounds 
are included. 

2.2.9 Other Collections Many hard-copy collec- 
tions of IR spectra exist. For a comprehensive list 
of the numerous older collections, the reader is re- 
ferred to the bibliography given in The Coblentz 
Society Desk Book of Infrared Spectra [9]. Para- 
graphs 2.2.3 and 2.2.5 describe new collection ef- 
forts aimed at the production of computerized IR 
databases. In addition, there are apparently several 
similar efforts now being initiated in Europe, nota- 
bly at the University of Essen [10]. 

2.3 Mass Spectra 

2.3.1 The Wiley Registry of Mass Spectral Data 

This collection has been put together and is main- 
tained by F. W. McLafferty at Cornell University. 
The database, available from John Wiley & Sons, 
Inc. on magnetic tape or in a CD-ROM version, 
contains 123,704 spectra of 108,173 compounds 
evaluated using a Quality Index algorithm [11] (see 
discussion below). Replicate spectra of a given 
compound are included. The magnetic tape version 
is distributed without search software, although 
software for matching unknown spectra which is 
tailored to this database is available free of charge 
from Cornell University [12-17]. 

2.3.2 The NIST/EPA/MSDC Mass Spectral 
Database This database was originally put to- 
gether by Drs. S. R. Heller and G. W. A. Milne of 
EPA and NIH, and called the EPA/NIH Mass 
Spectral Database. Since 1978, this database has 
been jointly administered by NIST and EPA, and 
new spectra are identified in the published litera- 
ture, collected in complete form from the original 
authors, and evaluated by the Mass Spectrometry 
Data Center (MSDC), Nottingham, England. The 
current database consists of 43,005 spectra, each 
one corresponding to a unique chemical com- 
pound. Spectra in the current version of the data- 
base were selected from an archive of 79,000 mass 
spectra and evaluated using a Quality Index al- 
gorithm, based on — but not exactly the same as — 
the algorithm developed by F. W. McLafferty to 
evaluate the Wiley database [18,19]. (The Quality 
Index evaluations are discussed in detail in sec. 4.) 



The database is distributed on tape without search 
software, and in a PC version with search software 
and elementary matching software. A new update, 
which will include several thousand new spectra, is 
being prepared for release in the fall of 1988. The 
corresponding PC-version will incorporate struc- 
tural information on all compounds in the database, 
as well as several new modes of matching spectra 
of unknown compounds to spectra in the database. 

2.3.3 The Merged Wiley/NBS Registry of Mass 
Spectral Data The Wiley and NBS/EPA/MSDC 
collections are also available from John Wiley & 
Sons in a merged version, which has a total of 
1 30,544 spectra (number of duplicate spectra in the 
two databases, 36,847). The merged database is 
available on tape and CD-ROM. A book version of 
the Merged Database is being published [20]. 

2.3.4 The Eight Peak Index The primary publi- 
cation of the Mass Spectrometry Data Center, 
(Royal Society of Chemistry, Nottingham, Eng- 
land) is made up of a set of seven volumes [21] 
including 65,000 eight-peak spectra of 52,332 com- 
pounds indexed by molecular weight, chemical for- 
mula, and most abundant ions. This collection of 
partial spectra is also available on tape. The collec- 
tion includes many of the same spectra included in 
the Wiley and NBS/EPA/MSDC collections. All 
of these collections of mass spectra have been put 
together incorporating older (non-computerized) 
data collections such as the spectra from the API 
Project 44 [la], the Thermodynamics Research 
Center [lb], and the American Society for Testing 
and Materials (ASTM) [le]. 

2.3.5 Collection of National Chemical Laboratory 
for Industry, Japan The integrated online "Spec- 
trum Database System" [6], which also includes 
collections of NMR, ESR, IR, and Raman spectra 
has a database of 10,000 mass spectra which were 
determined in the NCLI laboratories as part of the 
larger project. The system was recently made 
available to users in Japan. 

2.3.6 Other Collections (a) Japan Information 
System for Science and Technology (JICST) has 
an online "Mass Spectral Database System" 
searchable by name, formula. Chemical Abstracts 
Registry Number, and peaks. This system uses the 
NIST/EPA/MSDC database augmented by a col- 
lection of 6,000 spectra from the Mass Spectrome- 
try Society of Japan, (b) Dr. D. Henneberg 
(Max-Planck-Institut fur Kohlenforschung) has a 
collection of approximately 12,000 spectra, to 
which he is adding with the intention of building a 
database [22]. 
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3. Methods of Building Spectral Collec- 
tions 

While the above hsts make it clear that many 
collections of spectra for use in analytical chem- 
istry laboratories are available, it is also evident 
that Thomas Isenhour's complaint [3] that none of 
the collections contain more than about 100,000 
spectra is also substantially correct. In order to un- 
derstand why the sizes of available collections are 
so small even after several decades of collection 
effort (even excluding infrared spectroscopy, 
where earlier collections became less useful with 
the advent of new instrumentation), it is of interest 
to examine the techniques which are commonly 
used to collect spectra for such databases. This dis- 
cussion will also consider how the nature and qual- 
ity of a database is influenced by the way in which 
it has been put together. 

3.1 Laboratory Efforts 

The analytical chemistry databases listed above 
include several examples of collections which have 
been put together in a single laboratory by system- 
atically determining spectra of large numbers of 
chemical compounds for the specific purpose of 
building a database. The high quality collections of 
infrared spectra of compounds from the Aldrich 
and Sigma catalogues put together by Nicolet, an 
instrument manufacturer, are an example of this ap- 
proach. 

Another example is the integrated database sys- 
tem put together by the National Chemical Labo- 
ratory for Industry (Japan), which includes mass 
spectra, IR spectra, 'H and "C NMR spectra, as 
well as ESR and Raman spectra, all determined in 
the NCLI laboratories under carefully controlled 
conditions [6]. In addition to providing an excellent 
example of a carefully constructed collection of 
spectra, this system also is perhaps the most fully 
realized example of a trend which will undoubt- 
edly become important in the future — the use of 
integrated databases incorporating more than one 
kind of spectrum. 

Databases put together under this strategy are 
generally of high quality, since the purity of the 
compounds used as well as the instrument parame- 
ters can be controlled by the party building the 
database. In the case of the integrated database, 
there is the further advantage that the correctness 
of the data can be cross-checked by examining 
complementary information obtained from differ- 
ent techniques. 



In spite of the obvious advantages of this ap- 
proach, however, it must be admitted that this type 
of database-building effort is expensive and rela- 
tively slow. The NCLI effort, for example, has re- 
quired support for a laboratory effort including IR, 
mass spectral, and NMR instrumentation during 
approximately the past dozen years; the overall 
database index now contains 17,000 compounds [6]. 
The Thermodynamic Research Center at Texas 
A&M University sponsors a collection effort 
through laboratory measurements which generates 
about 75 spectra per year; again, the quality of the 
spectra is excellent, but one could never hope to 
build a large database by adding spectra at this rate. 

3.2 Collections Put Together through Donations of 
Spectra from Diverse Laboratories 

Many of the collections listed above have been 
put together by soliciting donations of spectra from 
many different laboratories. The Coblentz Society 
collection of IR spectra [7] and the American 
Petroleum Institute (API) Project 44 [la] collec- 
tions of several kinds of spectra are examples of 
successful efforts of this nature. This approach has 
the obvious advantage that when a cooperative 
pool of donors exists, a database can be built rela- 
tively quickly and inexpensively. 

On the other hand, when spectra are obtained 
from many different laboratories, there will in- 
evitably be large variations in the quality of the 
data, not to mention differences in spectra due to 
the use of instruments of varying design. For exam- 
ple, the mass spectral collections include spectra 
from both magnetic sector and quadrupole instru- 
ments, which may have different types of mass dis- 
crimination, and therefore may give slightly 
different spectra for the same compound. How- 
ever, the main problem associated with this collec- 
tion technique is that completion of a collection 
project necessarily depends on the labor of volun- 
teers. In general, the most successful efforts have 
been made when the management of a laboratory 
made the database collection a high priority work 
item (such as the petroleum industry's generation 
of the API Project 44 Collection). When the effort 
is purely voluntary — something which is done only 
when other (high priority) work assignments have 
been completed — experience has demonstrated that 
the time-consuming task of preparing data for 
transfer to a collection is rarely actually under- 
taken. 
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3.3 Collection of Data from the Literature 

A large number of scientific databases are com- 
posed by abstracting data from the scientific litera- 
ture. This approach can also be applied to the 
construction of a spectral database for analytical 
use, thus obviating the need for achieving coopera- 
tion from donors of spectra. The most successful 
example of this type of database is the Wiley Reg- 
istry of Mass Spectra, put together by F. W. 
McLafferty at Cornell University. As a result of 
the incorporation of spectra from the open litera- 
ture, the database has grown dramatically in recent 
years, achieving as noted above a size of 123,704 
spectra, up by about 50,000 over a period of some 
four or five years. 

The database one obtains using this strategy, 
however, has a somewhat different nature from the 
databases built up through dedicated laboratory 
measurements or donations of spectra directly from 
the laboratories in which they were measured. The 
spectral data reported in scientific papers are often 
incomplete, either because the journals do not have 
sufficient space to publish entire spectra, or be- 
cause the determination of a spectrum was not the 
primary motivation of the work reported in the lit- 
erature. Therefore, a database built up with a large 
component of spectra from the scientific literature 
will include mainly partial spectra. The mean size 
of a mass spectrum in the Wiley Registry is 29 
peaks, which can be compared with the mean size 
of the spectra in the NIST/EPA/MSDC Mass 
Spectral Database, 60 peaks (i.e., the mean size of 
the spectra taken from the literature is 13 peaks/ 
spectrum). 



4. Automated Evaluation of Spectral Col- 
lections 

Spectral collections which are put together by 
laboratories which determine each individual spec- 
trum are evaluated as they are built, and should not 
require much additional evaluation. However, 
when spectra come from a variety of sources, 
through donation schemes or literature acquisition, 
it is important to determine the quality of the spec- 
tra, and when a collection contains more than a 
few thousand spectra, it is obviously advantageous 
to have schemes whereby the spectral quality can 
be examined in some automated fashion. Such an 
approach to the evaluation of infrared spectra has 
recently been reported; a sheme was developed es- 
pecially for use with the University of California- 



Riverside "Clearinghouse for Digital Infrared 
Spectra" [8]. Since this scheme is new, however, 
few details are available about its successes and/or 
failures when used with an actual database. 

An automated evaluation scheme for mass spec- 
tra has been in use for many years, and the success- 
ful use of automated algorithms, as well as the 
kinds of problems which have been encountered, 
can be documented. The so-called Quality Index 
algorithm for mass spectra was originally proposed 
in 1978 by Speck, Venkataraghavan, and McLaf- 
ferty [1 1], who put together an automated examina- 
tion of various factors a trained mass 
spectrometrist would use in evaluating spectral 
quality. These included: (1) energy of the ionizing 
electrons; (2) presence of peaks at masses higher 
than the molecular weight of the compound; (3) 
presence of "illogical" peaks, which would not 
normally be formed in a compound of a particular 
formula; (4) whether or not relative isotopic abun- 
dances were correctly represented in the spectrum; 
(5) the total number of peaks in the spectrum (a 
measure of the completeness of the spectrum); (6) 
the mass of the lowest peak reported in the spec- 
trum (another measure of completeness); and (7) 
the source of the spectrum. 

Each factor was associated with a simple equa- 
tion designed to give a numerical grade ranging 
from to 1. For example, the so-called Quality 
Factor for the low mass limit was assigned by ex- 
amining the mass of the lowest peak reported in the 
spectrum (Mn,i„) and comparing it to the molecular 
weight of the compound (MW~), using the equation: 

(and QF was taken to be 1.0 for all compounds 
with molecular weight lower than 40). The final 
Quality Index (QI) for the spectrum was arrived at 
by multiplication: 

QI = QFrQFrQFrQF,-QF,.QF,.QFr(1000). 

Note that since the various factors are multiplied 
(rather than added) to achieve the final grade for a 
spectrum, a value of zero or a very low value for 
any single factor will lead to a low value for the 
spectrum as a whole. Furthermore, a spectrum 
receiving a rather high grade, but a grade less 
than unity, for each of the seven factors will 
end up with a low Quality Index value; 
(0.95)^X1000=698. 

The same approach was used by scientists 
putting together the NIST/EPA/MSDC Mass 
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Spectral Database [18,19], who omitted the seventh 
Quality Factor listed above (the source of the spec- 
trum), and added some additional factors, namely: 
(1) stated sample purity; (2) whether or not the 
mass spectrometer has been calibrated for the mea- 
surement, and, if it has, the availability of the cali- 
bration data; (3) the presence of a peak at mass 28 
(taken as evidence for the presence of air); (4) evi- 
dence for detector saturation; and (5) if the spec- 
trum does not contain a peak having a mass equal 
to the molecular weight, the highest mass peak 
which is included (again, an indicator of the com- 
pleteness of the spectrum). 

In addition, many of the algorithms originally 
formulated by the Cornell team were modified for 
use with the NIH/EPA database. The modifica- 
tions were based largely on analyses of the statistics 
of the individual quality factor values obtained for 
the spectra in the database. That is, it was assumed 
that (a) the standard deviation of the values ob- 
tained for any Quality Factor should be roughly 
proportional to the spectral significance of the 
property being measured; (b) the mean value of any 
given Quality Factor calculated for all the spectra 
in the database, should be 0.9 or greater, and (c) a 
Quality Factor should have a value of zero only in 
extreme cases. This is another way of saying that 
any Quality Factor which penalizes essentially all 
spectra in the database, or very few spectra, is not 
giving us any useful information for distinguishing 
between poor and good quality spectra. Thus, the 
modifications generally involved changing the 
equations to make the penalty greater or smaller, 
depending on the statistics observed. For example, 
the "low mass limit" Quality Factor given above 
was found to be weighted too strongly, and was 
modified to: 

QF=[{MW-M^,MMW-29)Y'^ for MW<\19, 

and Qi^ = [(M^+179-2Af™„)/ 

{MW+ 179-58)]'^' for MW> 179. 

In the NIST/EPA/MSDC database, until re- 
cently the protocol for putting together the data- 
base from the larger archive of spectra involved (1) 
calculating the Quality Index (Q/) value for each 
spectrum in the system; (2) when there was more 
than one spectrum of a given compound, selecting 
from among those spectra by taking the one with 
the highest QI value for inclusion in the database. 
The spectra were not at any time examined visually 



by a mass spectrometrist; all judgements and selec- 
tions were made using the automated procedure. 

In general, the calculation is very effective in 
choosing between good spectra and poor spectra. 
However, in the 1986 edition of the database, it 
was noted that there were instances in which the 
algorithm led to the selection of a poor spectrum 
over several good spectra. In other cases, good 
spectra were found which had been assigned very 
low Quality Index values. 

An analysis was made to identify the factors con- 
tributing to the observed problems. It was found, 
for instance, that spectra legitimately containing a 
large peak at m/z 28 were receiving low ratings 
because of the identification of that peak with the 
presence of air; the algorithm was modified to re- 
quire the simultaneous presence of m/z 28 and m/z 
32 with a ratio approximately the same as that one 
would observe for an air sample. Some of the frag- 
mentation processes considered by the algorithm to 
be "illogical" were found to be important for cer- 
tain types of compounds; as a result, all of the spec- 
tra of these compounds were receiving very low 
Quality Index values. For example, the "illogical 
loss" algorithm penalized all spectra in which there 
was an ion 2 mass-units below the parent molecular 
ion, that is, in which there was a fragmentation 
process consisting of a loss of H2 (or 2 H-atoms) 
from the molecular ion. This dissociation is very 
important for low molecular weight alkanes, and 
all alkane spectra were heavily penalized. The most 
abundant ion in the mass spectrum of ethane is at 
m/z 28 (C2H4+), and results from an "illogical loss" 
of two mass units, and therefore all spectra of 
ethane had Quality Index values of zero. 

Appropriate modifications to the algorithms 
were carried out, and the database was regener- 
ated. The archive contains some 16,000 spectra 
which are replicates; the new calculation resulted 
in the replacement of 620 spectra by other spectra 
from the archive. A visual examination of these 620 
pairs revealed that 50% of the changes had re- 
sulted in the selection of a spectrum of lower qual- 
ity than that originally included in the database. 
Some of these replacement pairs are shown in fig- 
ures 1-3. 

Figure 1 shows two mass spectra of HBr. At the 
last revision of the Quality Index calculation, the 
spectrum on the top replaced the spectrum on the 
bottom which contains HCl impurity peaks and so 
much water that m/z 18 is the major peak. Note 
that although the current algorithm results in the 
choice of the better spectrum, the difference in the 
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Figure 1. The mass spectrum of HBr shown on the bottom, containing water as the major 
component, was replaced by the spectrum shown on the top when the Quality Index calcula- 
tion was revised (see discussion in text). 



QI values between the good spectrum and the very 
bad spectrum is only 32 points 

Figure 2 shows four spectra of thiourea. The 
spectrum on the top (A) is missing a major peak at 
m /z 43 (it appears that this peak has been misiden- 
tified as m/zM), and an extra impurity peak at m/z 
44 (or 45). That incorrect spectrum was formerly 
selected for the database; the "illogical fragmenta- 
tion" algorithm did not recognize the incorrectly 
identified peaks. The spectrum (A) was replaced by 
the revised QI calculation with the spectrum (B) 
shown second, which now has a QI value 18 points 
higher than that of (A). Although spectrum (B) ap- 
pears to be somewhat more complete than spectra 
(C) and (D), it clearly suffers from detector satura- 
tion, and therefore would be considered by an ex- 
pert to be inferior in quality to both spectra (C) and 
(D). Curiously, the bad spectrum (A) receives the 
same QI grade as the good spectrum (C). Since the 
fragmentation of this parent ion does lead to the 



formation of an ion of m/z 42, it is unlikely that 
any algorithm could have detected the mistake in 
spectrum (A). 

Figure 3 shows two spectra with Quality Index 
values which are within two points of one another. 
The spectrum with the higher QI value contains 
peaks, for example, at masses 41 and 44, which can 
only originate from an impurity. 

An examination of these examples leads to the 
conclusion that this type of Quality Index algorithm 
could not have done any better at selecting the best 
spectrum from among replicates. With more fine 
tuning, this algorithm as it is presently constituted 
will never do any better. In setting up an evalua- 
tion-selection system of highly arbitrary equations, 
one is implicitly accepting that some statistical 
fraction of the spectra selected will be spectra 
which are not the best examples available in the 
archive. For instance, the recently-introduced 
Quality Factor, designed to penalize detector satu- 



32 



Volume 94, Number 1, January-February 1989 

Journal of Research of the National Institute of Standards and Technology 



l . l| "' l l ' l'| l |l | l | .| I ,' , i, Jl I , ■,! 



[A] QI = 702 



I I I I I I I I I I I I I I ' I I I 



,, iM l I h , l l,l , l|i l 



[B] QI = 720 



i t > > I I I I > ' I I ' ' I i ' ' I I I > I I I I ' ' ' ' I ' 



[C] QI = 702 



I I |l | l|' l i| 



I I I I i I I I I I M I 



Mm 



[D] QI = 627 



100 lie 120 130 140 150 



Figure 2. Four mass spectra of thiourea. In spectrum (A), m/z 43 has been misidentified as 
m/z 42; this is the spectrum originally selected by the QI calculation. Revision of the al- 
gorithm resulted in the choice of spectrum (B), which exhibits detector saturation. Spectra 
(C) and (D) (not selected by the program) are better quality spectra than (A) and (B). 
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Figure 3. Two mass spectra of dichloroacetyl chloride exhibiting Quality Index values which 
differ by only 2 points. The lower spectrum, which has the higher QI value, contains peaks, 
for example at masses 41 and 44, which can only originate from an impurity. 



ration, does so by searching for spectra having one 
or more additional peaks similar in magnitude to 
the base peak (peak of maximum abundance in a 
mass spectrum). Of course, some spectra legiti- 
mately have peaks of such magnitude, and they 
will be penalized; other spectra may be signifi- 
cantly saturated, but still pass such a test. The au- 
thors discuss this problem and conclude that these 
errors can be tolerated if the algorithm catches a 
large fraction of saturated spectra. 

Until a truly "expert system" approach to the 
evaluation of analytical mass spectra is devised, it 
appears that the only possible procedure for select- 
ing only the best available spectrum of each com- 
pound from an archive is to (1) use the existing 
Quality Index calculation as a rough first selection 
procedure, and (2) have an expert carry out a vi- 
sual selection from among those replicate sets for 
which the Quality Index values are within 200-300 
points of one another. This is the procedure now 
being carried out on the NIST/EPA/MSDC Mass 
Spectral Database, preparatory to release of the 
next update. 
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